Lecture 02

Bill Perry

Lecture 2: Review

  • We covered inductive vs deductive reasoning
  • How to begin to ask questions
  • Accuracy and precision
  • What are general types of data
  • How to set up an R project in Rstudio
  • How to install and load libraries
  • How to read a file into R
  • How to make a graph

Our first graph

Lecture 2: How to deal with data wrangling

  • Data management overview
  • How to make a tidy spreadsheet
  • Metadata - why you really should use it
  • Data repositories
  • R in practice

New image here

Lecture 2: Data management overview

  • Data: the raw material of science
  • Wide variety of formats, sizes, complexity
  • Data management and curation often underemphasized
  • Good data management: owe it to our funders, colleagues, supervisors, and study systems

Lecture 2: Data gathering - managing

Step 1

  1. Decide on what type of data you are collecting
  2. Decide on controlled vocabulary - odo_mgl, drp_ugl
  3. Decide on what has to happen to the data flow
  4. Organize your project -
  5. Enter the data as soon as you can
  6. in a spreadsheet as excel and csv
  7. you really need to be sure it is tidy

Tidy data by Whickham

Step 2:

  1. Make a MetaData sheet a. data about data b. descriptions, units, etc.

Lecture 2: Data gathering - managing

Step 3

  1. Store your raw data and metadata
  2. Electronic dataframes should be stored in ≥3 copies:
    1. Your computer (onsite)
    2. External storage (onsite)
    3. Offsite storage (e.g., cloud storage)
  3. Have regular backup strategy

Lecture 2: Data gathering - managing

Step 4

  1. Graph your data and check outliers, errors, missing data
  2. You can choose a NA or space… opinions differ
    1. you can set this when you import data
# Specifying NA values explicitly
data <- read_csv("your_file.csv", 
    na = c("", "NA", "N/A", "missing", "null"))

Lecture 2: Data gathering - managing

Step 5

Cleaning Data -

  1. Correct erors, fill missing data with “NA”, resolve outliers
  2. save a clean data file as the master file - often good to make read only
  3. Add information in a notes column or text file about what was done and why.

image

Lecture 2: Data gathering - managing

Step 6

  1. Time to graph the data and explore, summarize, and transform data
  2. If there are a lot of steps in cleaning and doing transfomations and calculations save them to new output file.

A good way to organize script files is number them in the order they get run.

Lecture 2: Data gathering - managing

The important considerations in data

  1. enter field/ lab data into electronic format as soon as possible and back it up in at least one location (e.g., cloud storage)
  2. do not modify raw data in any way following entry into electronic format
  3. store all data in an open-access format (e.g., .csv)
  4. thoroughly check and clean your raw data, saving it as a separate file (e.g., “output/cleaned_raw_data.csv”)
  5. accompany raw field/lab data with meta-data that is unambiguously linked to the raw data file
  6. carry out all analyses, calculations and visualization on a separate file from the “raw“ or “clean” data: the “analysis” data
  7. perform all data transformation, analysis and visualization by reproducible code and code shall be stored together with data
  8. arrange all raw and analysis data in “instance-row, variable-column” or tidy format: one column per variable

USE CONTROLLED VOCABULARY AND BE CONSISTENT THINK BEFORE DOING –> WHAT HAPPENS DOWN THE ROAD

Lecture 2: Data gathering - managing

Broman KW, & Woo KH. 2018. Data organization in spreadsheets. The American Statistician 72: 2-10 (HERE)

  • Spreadsheets break data - use with extreme caution
  • Spreadsheets: data entry and storage
  • R: visualization and analysis
  • Goal: organize data so readable by humans and computers

Lecture 2: Data gathering - managing

Be consistent!

  • Variable names
    • Codes for categorical variables
    • Variable names
      • use snake case and lower case - nitrate_n_mgl
      • always use the same name
    • Codes for missing values - NA or 9999 or a space - I know but I do it
    • Date formats -
      • YYYY-MM-DD HH:MM:SS
      • Time begins in 1970-01-01
    • names of objects
      • dataframes after import data_df
      • plots - len_wt_plot
      • models - anova_wt_model
    • File names
      • use separators - 2025_02_01_lake_x_inflow.csv
  • Note format Requires considerable foresight and organization

example of fish data

Lecture 2: Data gathering - managing

Variable and file names can be a problem

  • Avoid spaces but use underscore _
  • Avoid special characters @#$%^@#
  • Be sure to also use a variety of separators so you can separate later
    • or use the same number of characters across a variable name
    • 2025_03_04_file-site

image

Lecture 2: Data gathering - managing

Excel will drive you mad

  • it will mess up your dates
  • store data in separate columns - year - month - day
  • or use a string 20250401
  • always use unambiguous format of larges to smallest - why?
    • is 01 04 2025 the same as 04 01 2025

    • what are the dates in english?

    • or European

image

Lecture 2: Data gathering - managing

Never do Calculations in Excel

  • always do calculations in R - reproducible
  • never merge cells
  • can use highlighting but it will disappear
  • a nice rectangular dataframe will make you happy
    • tears will flow if not

image

Lecture 2: Data gathering - managing

Meta Data

  • This data will love beyond you
  • Someone will need to interpret it - what do they need
    • What is data about

    • Who collected it

    • When

    • Where

    • Funding agency

    • Methods used to collect

    • Variable names

      • description

      • units

      • abbreviations

    • CALCULATIONS AND WHY?

We need to know what happened and why?